
Abdu Samaraie Final project Preview
DSA501 Summer 2022
References:
Runestone Academy HTTLADS
Kaggle.com
Using the power of Exploratory Data Analysis to squeeze every ounce of insight from the data and drive business growth.
Online orders made it easy to collect user data that we did not have access to before. Like who would of known what is the customer first instinct when buying from the store? But with online orders, we do now have this information in our hand.
Instacart is a grocery ordering and delivery app with over 500 Million products and 40000 stores serves across U.S. & Canada. Instacart provides a user experience where you will get product recommendation based on your previous orders.
Back in 2017, the company announced its first public dataset release, which is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users.
The goal of this project is to find which products customer tend to buy first when they start a new order online? Does their orders consist mostly of vegan/vegitarian/health foods? Answering those questions will give us valiable insights on customer buying habits.
Step 1: Download the data
The data for the final project can be obtained from Kaggle as well.
# Importing libraries
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.io as pio
import plotly.offline as pyo
import plotly.graph_objs as go
import pandas as pd
import numpy as np
# To displays frames side by side
class display(object):
"""Display HTML representation of multiple objects"""
template = """<div style="float: left; padding: 10px;">
<p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
</div>"""
def __init__(self, *args):
self.args = args
def _repr_html_(self):
return '\n'.join(self.template.format(a, eval(a)._repr_html_())
for a in self.args)
def __repr__(self):
return '\n\n'.join(a + '\n' + repr(eval(a))
for a in self.args)
# Disable warnings
import warnings
warnings.filterwarnings('ignore')
Step 2: explore the datasets and features:
!pip install kaggle
Collecting kaggle Using cached kaggle-1.5.16-py3-none-any.whl Requirement already satisfied: bleach in p:\anaconda\lib\site-packages (from kaggle) (4.0.0) Requirement already satisfied: certifi in p:\anaconda\lib\site-packages (from kaggle) (2021.10.8) Requirement already satisfied: python-slugify in p:\anaconda\lib\site-packages (from kaggle) (5.0.2) Requirement already satisfied: python-dateutil in p:\anaconda\lib\site-packages (from kaggle) (2.8.2) Requirement already satisfied: six>=1.10 in p:\anaconda\lib\site-packages (from kaggle) (1.16.0) Requirement already satisfied: tqdm in p:\anaconda\lib\site-packages (from kaggle) (4.62.3) Requirement already satisfied: urllib3 in p:\anaconda\lib\site-packages (from kaggle) (1.26.7) Requirement already satisfied: requests in p:\anaconda\lib\site-packages (from kaggle) (2.31.0) Requirement already satisfied: packaging in p:\anaconda\lib\site-packages (from bleach->kaggle) (21.0) Requirement already satisfied: webencodings in p:\anaconda\lib\site-packages (from bleach->kaggle) (0.5.1) Requirement already satisfied: pyparsing>=2.0.2 in p:\anaconda\lib\site-packages (from packaging->bleach->kaggle) (3.0.4) Requirement already satisfied: text-unidecode>=1.3 in p:\anaconda\lib\site-packages (from python-slugify->kaggle) (1.3) Requirement already satisfied: charset-normalizer<4,>=2 in p:\anaconda\lib\site-packages (from requests->kaggle) (2.0.4) Requirement already satisfied: idna<4,>=2.5 in p:\anaconda\lib\site-packages (from requests->kaggle) (3.2) Requirement already satisfied: colorama in p:\anaconda\lib\site-packages (from tqdm->kaggle) (0.4.4) Installing collected packages: kaggle Successfully installed kaggle-1.5.16
%%time
# Install needed libraries
import json
import sys
sys.path.append("./content/")
from content.token import token
# read the data
# read the json file from the current directory dump the token into it
with open('./.kaggle/kaggle.json', 'w') as file:
json.dump(token, file)
# Get a list of the datasets in kaggle with "instacart" in its name
!kaggle datasets list -s instacart
ref title size lastUpdated downloadCount voteCount usabilityRating ------------------------------------------------------------- ------------------------------------------------ ----- ------------------- ------------- --------- --------------- psparks/instacart-market-basket-analysis Instacart Market Basket Analysis 197MB 2017-11-20 03:08:26 11956 149 0.4117647 yasserh/instacart-online-grocery-basket-analysis-dataset InstaCart Online Grocery Basket Analysis Dataset 197MB 2022-01-25 16:43:37 3135 49 1.0 viswajithkn/instacart-predict-shopping-time Instacart Predict Shopping Time 12MB 2018-10-29 17:24:48 586 12 0.64705884 brendanartley/simplifiedinstacartdata Simplified Instacart Market Basket Dataset 184MB 2022-04-03 14:53:59 330 6 0.8235294 dhruvshan/instacart-stock-data-all-time Instacart Stock Data (All Time) 2KB 2023-12-19 07:51:37 1 3 0.9411765 samwhitehill/instacart-training-sample InstaCart training sample 1MB 2017-07-01 20:24:11 206 2 0.5882353 samwhitehill/instacart-sample-labels Instacart sample labels 9KB 2017-07-01 20:26:16 41 1 0.4117647 rajkumarpandey02/complete-list-of-unicorn-companies Complete List Of Unicorn Companies 40KB 2022-11-15 09:57:49 146 20 0.9411765 nguynlonghi/instacart-100k-real Instacart-100k-real 20MB 2023-07-07 15:37:12 6 1 0.29411766 streptkinase/instacart-data-cache Instacart_data_cache 409MB 2020-07-31 05:35:52 24 1 0.1764706 rajashrideka/instacart-analysis-dataset Instacart_Analysis_Dataset 238MB 2022-07-22 19:28:26 30 1 0.1764706 thedevastator/new-whole-foods-on-sale-product-data-collection Whole Foods Products Data 3MB 2022-11-17 00:12:49 1337 34 0.9411765 emmettdoolan/instacart-mba Instacart_MBA 198MB 2020-04-08 13:46:01 76 1 0.11764706 tungbinhthuong/instacart-train-test instacart_train_test 303MB 2022-07-16 06:18:45 10 1 0.11764706 gyanendrachaubey/next-purchase-date-prediction Next Purchase Date Prediction 4MB 2023-02-22 05:52:10 25 1 0.5882353 despoinapatsavoudi/instacat2assignment1 instacat-2-assignment.1 4KB 2021-11-21 03:26:28 10 2 0.0 streptkinase/user-features-instacart user_features_instacart 4MB 2020-08-06 05:56:24 20 1 0.11764706 anhgereng/fender-rs Fender_RS 281MB 2023-11-20 16:03:30 0 0 0.25 nguynlonghi/training-phase-instacart Training_phase_instacart 33MB 2023-07-07 00:28:50 5 0 0.11764706 Wall time: 720 ms
url = './content/'
files=['aisles','departments','order_products__prior','order_products__train','orders','products']
for file in files:
# Create a Pandas DataFrame for each dataset
globals()[file]=pd.read_csv(url+file+'.csv')
# asign the data to a variable
orders = pd.read_csv('./content/orders.csv')
aisles = pd.read_csv('./content/aisles.csv')
departments = pd.read_csv('./content/departments.csv')
products = pd.read_csv('./content/products.csv')
order_products__prior = pd.read_csv('./content/order_products__prior.csv')
order_products__train = pd.read_csv('./content/order_products__train.csv')
display(files[0],files[1],files[2])
aisles
| aisle_id | aisle | |
|---|---|---|
| 0 | 1 | prepared soups salads |
| 1 | 2 | specialty cheeses |
| 2 | 3 | energy granola bars |
| 3 | 4 | instant foods |
| 4 | 5 | marinades meat preparation |
| ... | ... | ... |
| 129 | 130 | hot cereal pancake mixes |
| 130 | 131 | dry pasta |
| 131 | 132 | beauty |
| 132 | 133 | muscles joints pain relief |
| 133 | 134 | specialty wines champagnes |
134 rows × 2 columns
departments
| department_id | department | |
|---|---|---|
| 0 | 1 | frozen |
| 1 | 2 | other |
| 2 | 3 | bakery |
| 3 | 4 | produce |
| 4 | 5 | alcohol |
| 5 | 6 | international |
| 6 | 7 | beverages |
| 7 | 8 | pets |
| 8 | 9 | dry goods pasta |
| 9 | 10 | bulk |
| 10 | 11 | personal care |
| 11 | 12 | meat seafood |
| 12 | 13 | pantry |
| 13 | 14 | breakfast |
| 14 | 15 | canned goods |
| 15 | 16 | dairy eggs |
| 16 | 17 | household |
| 17 | 18 | babies |
| 18 | 19 | snacks |
| 19 | 20 | deli |
| 20 | 21 | missing |
order_products__prior
| order_id | product_id | add_to_cart_order | reordered | |
|---|---|---|---|---|
| 0 | 2 | 33120 | 1 | 1 |
| 1 | 2 | 28985 | 2 | 1 |
| 2 | 2 | 9327 | 3 | 0 |
| 3 | 2 | 45918 | 4 | 1 |
| 4 | 2 | 30035 | 5 | 0 |
| ... | ... | ... | ... | ... |
| 32434484 | 3421083 | 39678 | 6 | 1 |
| 32434485 | 3421083 | 11352 | 7 | 0 |
| 32434486 | 3421083 | 4600 | 8 | 0 |
| 32434487 | 3421083 | 24852 | 9 | 1 |
| 32434488 | 3421083 | 5020 | 10 | 1 |
32434489 rows × 4 columns
display(files[3],files[4],files[5])
order_products__train
| order_id | product_id | add_to_cart_order | reordered | |
|---|---|---|---|---|
| 0 | 1 | 49302 | 1 | 1 |
| 1 | 1 | 11109 | 2 | 1 |
| 2 | 1 | 10246 | 3 | 0 |
| 3 | 1 | 49683 | 4 | 0 |
| 4 | 1 | 43633 | 5 | 1 |
| ... | ... | ... | ... | ... |
| 1384612 | 3421063 | 14233 | 3 | 1 |
| 1384613 | 3421063 | 35548 | 4 | 1 |
| 1384614 | 3421070 | 35951 | 1 | 1 |
| 1384615 | 3421070 | 16953 | 2 | 1 |
| 1384616 | 3421070 | 4724 | 3 | 1 |
1384617 rows × 4 columns
orders
| order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order | |
|---|---|---|---|---|---|---|---|
| 0 | 2539329 | 1 | prior | 1 | 2 | 8 | NaN |
| 1 | 2398795 | 1 | prior | 2 | 3 | 7 | 15.0 |
| 2 | 473747 | 1 | prior | 3 | 3 | 12 | 21.0 |
| 3 | 2254736 | 1 | prior | 4 | 4 | 7 | 29.0 |
| 4 | 431534 | 1 | prior | 5 | 4 | 15 | 28.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 3421078 | 2266710 | 206209 | prior | 10 | 5 | 18 | 29.0 |
| 3421079 | 1854736 | 206209 | prior | 11 | 4 | 10 | 30.0 |
| 3421080 | 626363 | 206209 | prior | 12 | 1 | 12 | 18.0 |
| 3421081 | 2977660 | 206209 | prior | 13 | 1 | 12 | 7.0 |
| 3421082 | 272231 | 206209 | train | 14 | 6 | 14 | 30.0 |
3421083 rows × 7 columns
products
| product_id | product_name | aisle_id | department_id | |
|---|---|---|---|---|
| 0 | 1 | Chocolate Sandwich Cookies | 61 | 19 |
| 1 | 2 | All-Seasons Salt | 104 | 13 |
| 2 | 3 | Robust Golden Unsweetened Oolong Tea | 94 | 7 |
| 3 | 4 | Smart Ones Classic Favorites Mini Rigatoni Wit... | 38 | 1 |
| 4 | 5 | Green Chile Anytime Sauce | 5 | 13 |
| ... | ... | ... | ... | ... |
| 49683 | 49684 | Vodka, Triple Distilled, Twist of Vanilla | 124 | 5 |
| 49684 | 49685 | En Croute Roast Hazelnut Cranberry | 42 | 1 |
| 49685 | 49686 | Artisan Baguette | 112 | 3 |
| 49686 | 49687 | Smartblend Healthy Metabolism Dry Cat Food | 41 | 8 |
| 49687 | 49688 | Fresh Foaming Cleanser | 73 | 11 |
49688 rows × 4 columns
Aisles = Aisle number with description of the aisles in the store
Departments = departments name in the store (description of items of the department
order_products__prior = what each customer bought previously
order_products__train = last thing each customer bought
orders = what each customer bought
products = all the products available in the store
aisles (features):
departments (features):
order_products__prior (features):
order_products_train (features):
orders (features):
products (features):
orders.user_id.value_counts().loc[1]
11
orders[orders.user_id==1]
| order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order | |
|---|---|---|---|---|---|---|---|
| 0 | 2539329 | 1 | prior | 1 | 2 | 8 | NaN |
| 1 | 2398795 | 1 | prior | 2 | 3 | 7 | 15.0 |
| 2 | 473747 | 1 | prior | 3 | 3 | 12 | 21.0 |
| 3 | 2254736 | 1 | prior | 4 | 4 | 7 | 29.0 |
| 4 | 431534 | 1 | prior | 5 | 4 | 15 | 28.0 |
| 5 | 3367565 | 1 | prior | 6 | 2 | 7 | 19.0 |
| 6 | 550135 | 1 | prior | 7 | 1 | 9 | 20.0 |
| 7 | 3108588 | 1 | prior | 8 | 1 | 14 | 14.0 |
| 8 | 2295261 | 1 | prior | 9 | 1 | 16 | 0.0 |
| 9 | 2550362 | 1 | prior | 10 | 4 | 8 | 30.0 |
| 10 | 1187899 | 1 | train | 11 | 4 | 8 | 14.0 |
orders.columns
Index(['order_id', 'user_id', 'eval_set', 'order_number', 'order_dow',
'order_hour_of_day', 'days_since_prior_order'],
dtype='object')
if len(orders) == len(orders.order_id.unique()):
print("order id is unique")
order id is unique
Answer: each order appears only once. They are unique in that sense.
len(orders)==len(orders.user_id.unique())
False
orders.user_id.value_counts()
152340 100
185641 100
185524 100
81678 100
70922 100
...
13046 4
70208 4
111440 4
31762 4
47980 4
Name: user_id, Length: 206209, dtype: int64
f_dict={}
for file in files:
f_dict[file]=[globals()[file].columns.values]
df=pd.DataFrame(f_dict).T
df
| 0 | |
|---|---|
| aisles | [aisle_id, aisle] |
| departments | [department_id, department] |
| order_products__prior | [order_id, product_id, add_to_cart_order, reor... |
| order_products__train | [order_id, product_id, add_to_cart_order, reor... |
| orders | [order_id, user_id, eval_set, order_number, or... |
| products | [product_id, product_name, aisle_id, departmen... |
pd. set_option('display.max_columns', None)
pd. set_option('display.max_colwidth', None)
df
| 0 | |
|---|---|
| aisles | [aisle_id, aisle] |
| departments | [department_id, department] |
| order_products__prior | [order_id, product_id, add_to_cart_order, reordered] |
| order_products__train | [order_id, product_id, add_to_cart_order, reordered] |
| orders | [order_id, user_id, eval_set, order_number, order_dow, order_hour_of_day, days_since_prior_order] |
| products | [product_id, product_name, aisle_id, department_id] |
df5 = pd.merge(orders,order_products__train)[['user_id','order_id','product_id']]
df5
| user_id | order_id | product_id | |
|---|---|---|---|
| 0 | 1 | 1187899 | 196 |
| 1 | 1 | 1187899 | 25133 |
| 2 | 1 | 1187899 | 38928 |
| 3 | 1 | 1187899 | 26405 |
| 4 | 1 | 1187899 | 39657 |
| ... | ... | ... | ... |
| 1384612 | 206209 | 272231 | 40603 |
| 1384613 | 206209 | 272231 | 15655 |
| 1384614 | 206209 | 272231 | 42606 |
| 1384615 | 206209 | 272231 | 37966 |
| 1384616 | 206209 | 272231 | 39216 |
1384617 rows × 3 columns
df51 = df5[df5.order_id==1187899]
df51a = pd.merge(df51,products)[['user_id','product_name']]
df51a
| user_id | product_name | |
|---|---|---|
| 0 | 1 | Soda |
| 1 | 1 | Organic String Cheese |
| 2 | 1 | 0% Greek Strained Yogurt |
| 3 | 1 | XL Pick-A-Size Paper Towel Rolls |
| 4 | 1 | Milk Chocolate Almonds |
| 5 | 1 | Pistachios |
| 6 | 1 | Cinnamon Toast Crunch |
| 7 | 1 | Aged White Cheddar Popcorn |
| 8 | 1 | Organic Whole Milk |
| 9 | 1 | Organic Half & Half |
| 10 | 1 | Zero Calorie Cola |
ax=order_products__prior.product_id.value_counts()
ax.index
Int64Index([24852, 13176, 21137, 21903, 47209, 47766, 47626, 16797, 26209,
27845,
...
1002, 22747, 34331, 41967, 3117, 14756, 20264, 31254, 13397,
23624],
dtype='int64', length=49677)
products[products.product_id==24852]
| product_id | product_name | aisle_id | department_id | |
|---|---|---|---|---|
| 24851 | 24852 | Banana | 24 | 4 |
Answer: banana
Question: What is the most popular second item?
products[products.product_id==13176]
| product_id | product_name | aisle_id | department_id | |
|---|---|---|---|---|
| 13175 | 13176 | Bag of Organic Bananas | 24 | 4 |
productOrdered = pd.merge(products,order_products__prior, on="product_id")
deptdf = pd.merge(departments,productOrdered, on='department_id')
df = deptdf.department.value_counts().sort_values(ascending=False).reset_index().rename(columns={'index': 'department', 'department':'count'} )
data = [go.Bar(
x=df['department'],
y=df['count']
)]
layout = go.Layout(
title='Most Shopped Departments by Product Purchased',
yaxis_title = "Number of items bought",
xaxis_title = "Departments"
)
fig = go.Figure(data=data, layout=layout)
fig.update_xaxes(tickangle = 45)
fig.show()
# produce from the chart above
plt.figure(figsize=(16,8))
ax = sns.countplot(x='order_dow',data=orders)
plt.ylabel('Number of Orders', fontsize=16)
plt.xlabel('Day of Week (Sunday=0,Monday=1 etc.)', fontsize=16)
plt.show(ax)
What can you understand from the customer first add to cart item?¶
Knowing which product graps the attintion of the customer first is really important. It can help us retarget the customer with a marketing campagin to recommend a similar products he is definitly intrested in.
Solution: Products placed first in cart are the products mostly reordered. We can invest in this information and start recommeding product that are similar to the most reordered product.
Note: This is called Cross-selling or up-selling in marketing terms.
# first compine order_product_train and prior
df1 = pd.concat([order_products__prior, order_products__train], # Append two pandas DataFrames
ignore_index = True,
sort = False)
# get only the products that has been added to cart first and was reordered
df2 = df1[(df1['add_to_cart_order'] == 1) & (df1['reordered'] == 1)]
# group it by the product that was frequently reordered the most
df3 = df2.groupby(['product_id'])['reordered'].agg(['sum'])
df4 = df3.merge(products, on='product_id').sort_values('sum',ascending=False)
df4
| product_id | sum | product_name | aisle_id | department_id | |
|---|---|---|---|---|---|
| 16480 | 24852 | 102269 | Banana | 24 | 4 |
| 8624 | 13176 | 72651 | Bag of Organic Bananas | 24 | 4 |
| 18447 | 27845 | 27781 | Organic Whole Milk | 84 | 16 |
| 13961 | 21137 | 23022 | Organic Strawberries | 24 | 4 |
| 31424 | 47209 | 20389 | Organic Hass Avocado | 24 | 4 |
| ... | ... | ... | ... | ... | ... |
| 14612 | 22087 | 1 | Whole Celery Seed | 104 | 13 |
| 14611 | 22086 | 1 | Fire Roasted Diced Tomatoes With Garlic | 81 | 15 |
| 14610 | 22084 | 1 | Chocolate Ice Cream in Wafer cups | 37 | 1 |
| 7295 | 11113 | 1 | 9 Grain Bread | 112 | 3 |
| 19768 | 29839 | 1 | Superior Preference Fade-Defying Color + Shine System - 4 Dark Brown | 22 | 11 |
33065 rows × 5 columns
data = [go.Bar(
x=df4['product_name'].head(20),
y=df4['sum']
)]
layout = go.Layout(
title='Which product was purchased first most of the time by all customers',
yaxis_title = "Number of reorders",
xaxis_title = "Product Name"
)
fig = go.Figure(data=data, layout=layout)
fig.update_xaxes(tickangle = 45)
fig.show()
Doing this Exploratory Data Analysis allows me to understand in precise details about the customer shopping behavior on the Instacart platform. Knowing which items are first purchased is the first step for Instacart to optimize its software product and recommend items for customers while they shop.